Resampling Methods

Advance Analytics with R (UG 21-24)

Ayush Patel

Before we start

Please load the following packages

library(tidyverse)
library(MASS)
library(ISLR)
library(ISLR2)
library(nnet)### get this if you don't
library(e1071) ## get this if you don't



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objective

To better understand model assessment and selection.

We will learn resampling methods to help us achieve this objective.

Resampling Methods

Essentially, we fit a model over and over using different subsets of training data.

  • Accuracy of estimated parameters.
  • Performance of a model.
  • Estimate test error rates.

We will learn about Cross-Validation and Bootstrap

Cross Validation

We know that:

  • Usually, test sets are not available.
  • Training error rates > Test Error rates

So we use some techniques to estimate Test Error rate.

We will carve out a subset from the training data. This carved subset will not be used in fitting process. Then we use this carved subset to fit the model.

Cross Validation -Validation Set Approach

  • Split the data into parts. Training set and validation set.
  • Fit model using the training set.
  • Use the Validation set and fitted model to predict observations from validation set.
  • Calculate validation set error rate.

Cross Validation - Validation set Approach

Recall

lm(mpg ~ horsepower,
   data = Auto)|>summary()

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
lm(mpg ~ horsepower + poly(horsepower,2),
   data = Auto)|>summary()

Call:
lm(formula = mpg ~ horsepower + poly(horsepower, 2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7135  -2.5943  -0.0859   2.2868  15.8961 

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39.935861   0.639714   62.43   <2e-16 ***
horsepower           -0.157845   0.005747  -27.47   <2e-16 ***
poly(horsepower, 2)1        NA         NA      NA       NA    
poly(horsepower, 2)2 44.089528   4.373921   10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686 
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Cross Validation - Validation set Approach

  • What if we keep increasing the power of the polynomial?
  • Do we get consistent error rates??

Cross Validation - Validation set Approach

calc_validation_mse <- function(pow) {
  Auto_marked <- Auto|>
  mutate(
    set_train = sample(c(1,0),392,
                       replace =T,
                       prob = c(0.5,0.5))
    )

Auto_test <- Auto_marked[Auto_marked$set_train == 1,]
Auto_valid <- Auto_marked[Auto_marked$set_train == 0,]

mod <- lm(mpg ~ poly(horsepower,pow),
          Auto_test)

tibble(
  power = pow,
  validation_mse = mean((Auto_valid$mpg - predict(mod,Auto_valid))^2)
)

}


map_dfr(
  c(1:10),
  calc_validation_mse
)|>
  ggplot(aes(power,validation_mse))+
  geom_point(colour="steelblue")+
  scale_x_continuous(breaks = c(1:10))+
  geom_line(colour = "red")+
  theme_bw()+
  labs(
    y = "MSE",
    x = "Degree of polynomial"
    )-> plot_powers

Cross Validation - Validation set Approach

Cross Validation - Validation set Approach

Problems with Validation set approach

  • Validation estimate of test error rate is highly variable.
  • Since number of observations at disposal reduces for training, statistical methods don’t perform well. This leads to overestimation of test error rate.

Exercise

Use the penguins data from Palmerpenguins package.

Split data in 50/50 training and validation set.

Use species as response. Train a multinomial logistic regression to predict species.

Write a code that will record the validation error rate.

Iterate this a 100 time and plot a density chart of the validation error rate.

leave-one-out Cross Validation

  • Leave out first observation from the data.
  • Use remaining observations to train a model.
  • Use the left out observation to calculate validation MSE.
  • Repeat this process till all observations are exhausted and calculate Average validation MSE.

Less Bais compared to validation set approach. So, does not overestimate test error rate.

Since no randomness in training/testing split, very little variablity of test error rate.